Introduction

The aim of this study is to analyse the monthly residential energy consumption in the US, since January 2000. The analysis contains parts:

  • Data cleaning
  • Analysis of Missing Values
  • Visualizations
  • Univariate Time series Analysis.
  • Multivariate Time series Analysis

Data description

We consider data from the US Energy Information Administration’s online data browser.The data was downloaded from Table 2.2 of the US Energy Information Administration’s online data browser (https://www.eia.gov/totalenergy/data/browser/). The data set contains multiple time series, detailing:

  • Primary US residential energy consumption patterns per energy source (Fossil Fuels and Renewable Energy) from 1949 until 2021.
  • Electricity Retail Sales
  • Electrical System Energy Losses

The data includes 7,896 observations for each month of each year since 1949 with 6 columns as it can be seen in the table below.

MSN YYYYMM Value Column_Order Description Unit
CLRCBUS 194913 1271.551 1 Coal Consumed by the Residential Sector Trillion Btu
CLRCBUS 195013 1261.267 1 Coal Consumed by the Residential Sector Trillion Btu
CLRCBUS 195113 1158.679 1 Coal Consumed by the Residential Sector Trillion Btu
CLRCBUS 195213 1079.206 1 Coal Consumed by the Residential Sector Trillion Btu
CLRCBUS 195313 965.664 1 Coal Consumed by the Residential Sector Trillion Btu

In particular:

  • The Columns MSN, Column_order and Description are describing the the source of energy. (i.e. Coal, Natural Gas, Geothermal etc.)
  • All the values have the same units (Trillion BTU).

Data cleaning

Before processing the data, we separate the abbreviations (MSN) and descriptions (Description) into two data frames, one of which contains the overall values and the other of which contains all individual energy consumption per source. The two tables can be seen below:

##Extract codes and names
code<-data.frame(Detail=unique(orig_dat$MSN),Desc=unique(orig_dat$Description))

##Individual Values
ind<-code %>% 
  filter(!grepl('Total', Desc))

##Total Values 
tot<-code %>% 
  filter(grepl('Total', Desc))
Next, we look into the data type of each column.
x
MSN character
YYYYMM numeric
Value character
Column_Order numeric
Description character
Unit character

Looking closer at the data set, we observe that:

  • The data includes missing values that are labelled as “Non available.”
  • As we mentioned above, the columns description,column_order and MSN describe the same property, and as a result, we will drop them.
  • The majority of our data is classified as characters apart from the date and column order, which are classified as numbers.

We convert the original matrix into a data frame and drop the columns we mentioned above.We also convert the values that are set to “Not Available” to NA.

#Create new dataframe in order to preserve the original data
dat<-data.frame(orig_dat)
#Drop columns 
dat<-subset(orig_dat, select = -c(Unit,Description,Column_Order))
#Convert missing values
dat$Value[dat$Value=='Not Available']<-NA

The column containing the dates is set to the numeric data type. We convert it to characters, extract the year and month, and drop the original column. Next, we drop the rows containing data from 1949 to 1999, since we only interested on data containing information after 2000 . Finally, we observe that there is a month 13, which corresponds to the total energy consumption for each year.

##Convert to character
dat$YYYYMM<-as.character(dat$YYYYMM)
##Create new column containing the dates
dat <- dat %>%
  add_column(year = substr(dat$YYYYMM, start = 1, stop = 4),.after = "YYYYMM") 
dat <- dat %>%
  add_column(mon = substr(dat$YYYYMM, start = 5, stop = 6),.after = "year")  
##Drop original column
dat <- subset(dat, select = -YYYYMM )
## Drop rows with a date before january 2000
ec<-subset(dat,year>=2000)
## Drop rows with month 13 which corresponds to total
ec<-subset(ec,mon!=13)

Finally, we convert all columns to the correct data types.

#Combine dates in the right format
ec$year<-as.numeric(ec$year)
ec$mon<-as.numeric(ec$mon)
ec$Value<-as.numeric(ec$Value)
ec$Date <- as.yearmon(paste(ec$year, ec$mon), "%Y %m")
The final data frame can be seen below:
MSN year mon Value Date
CLRCBUS 2000 1 1.466 Jan 2000
CLRCBUS 2000 2 1.093 Feb 2000
CLRCBUS 2000 3 0.848 Mar 2000
CLRCBUS 2000 4 0.967 Apr 2000
CLRCBUS 2000 5 0.648 May 2000

Note: The last column may appear redundant, since we already have the year and month in the previous columns. We keep it as the format is easier for visualization purposes.

Missing Values

There are 166 missing values, as seen in the figure and table below. Those values are only missing from the data containing coal energy consumption. To make the visualization process easier, we dropped such values rather of using them.

#Find number a missing values per group
gg_miss_fct(ec, MSN) 

ec %>% group_by(MSN) %>% summarise(
  non_na = sum(!is.na(Value)),na = sum(is.na(Value)))
#Delete NAs and get final data frame
f.data <- na.omit(ec) 

Visualisations

Residential Sector Energy Consumption per source

Let’s start by visualizing the energy consumption for each energy source. It’s difficult to detect any distinct trends from the above graph. We observe natural gas is the most used energy source with significant variations between the winter and summer seasons. Additionally, we can observe that although there has been a little increasing tendency in electricity sales over time, volatility has remained constant. We’re going to visualize individually the consumption for fossil fuels and renewable energy.

Individual sources: Pie charts and Line graphs

Next to better see the amount of energy used by each source, we will create a pie chart. The tables created in the data cleaning phase can be used to interpret the abbreviations.

Finally, we’re going to visualise the mean energy consumed per individual energy source.

## `summarise()` has grouped output by 'year'. You can override using the `.groups`
## argument.

Summary statistics and box plots per individual source

##       Date         BMRCBUS         CLRCBUS          ESRCBUS     
##  Min.   :2000   Min.   :28.38   Min.   :0.3820   Min.   :260.1  
##  1st Qu.:2002   1st Qu.:32.26   1st Qu.:0.6240   1st Qu.:315.7  
##  Median :2004   Median :33.61   Median :0.7730   Median :361.1  
##  Mean   :2004   Mean   :33.44   Mean   :0.8529   Mean   :367.1  
##  3rd Qu.:2006   3rd Qu.:35.34   3rd Qu.:1.0283   3rd Qu.:410.3  
##  Max.   :2008   Max.   :36.52   Max.   :1.7910   Max.   :512.1  
##     GERCBUS         LORCBUS          NNRCBUS          PARCBUS      
##  Min.   :0.681   Min.   : 604.5   Min.   : 111.1   Min.   : 54.93  
##  1st Qu.:0.803   1st Qu.: 691.3   1st Qu.: 142.9   1st Qu.: 82.13  
##  Median :1.107   Median : 769.5   Median : 312.4   Median :107.41  
##  Mean   :1.161   Mean   : 804.0   Mean   : 410.9   Mean   :120.08  
##  3rd Qu.:1.363   3rd Qu.: 892.7   3rd Qu.: 692.9   3rd Qu.:156.50  
##  Max.   :1.868   Max.   :1129.9   Max.   :1001.5   Max.   :226.45  
##     SORCBUS     
##  Min.   :2.718  
##  1st Qu.:3.648  
##  Median :4.649  
##  Mean   :4.446  
##  3rd Qu.:5.190  
##  Max.   :5.984
## Warning in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm =
## na.rm): NAs introduced by coercion
##        Date     BMRCBUS     CLRCBUS     ESRCBUS     GERCBUS     LORCBUS 
##          NA   1.9651206   0.3333607  61.7581942   0.3657290 136.0019233 
##     NNRCBUS     PARCBUS     SORCBUS 
## 286.4640758  45.3295816   0.9139452

Total Energy consumption

We are going to calculate the summary data for the total consumption and sales.The graphs below show the total energy consumed, the energy consumed in the recent years and the proportion of each one.

##     FFRCBUS          RERCBUS         TERCBUS        TXRCBUS      
##  Min.   : 134.5   Min.   :32.33   Min.   :1271   Min.   : 192.3  
##  1st Qu.: 202.5   1st Qu.:40.47   1st Qu.:1434   1st Qu.: 247.6  
##  Median : 354.4   Median :49.62   Median :1634   Median : 403.6  
##  Mean   : 503.3   Mean   :50.58   Mean   :1743   Mean   : 553.9  
##  3rd Qu.: 813.7   3rd Qu.:58.68   3rd Qu.:1972   3rd Qu.: 863.6  
##  Max.   :1224.4   Max.   :76.88   Max.   :2808   Max.   :1272.0

Univariate data analysis

For the univariate explaratory data analysis we choose to study the natural gas consumption. We create a time series with the values that we subseted from the dataset starting from 2000 with monthly frequency.
MSN year mon Value Date
NNRCBUS 2000 1 881.954 Jan 2000
NNRCBUS 2000 2 792.410 Feb 2000
NNRCBUS 2000 3 562.305 Mar 2000
NNRCBUS 2000 4 409.652 Apr 2000
NNRCBUS 2000 5 233.428 May 2000

Visualisation

Time series Analysis of Natural Gas consumption

Autocorrelation and Partial Correlation

In the figure below, it can be observed that the ACF is sinusoidal, which implies that the ACF is periodic. The peak recurs every 12 implies that the ACF has a period of 12 lags which when calculated is equivalent to a year.

The ACF is not approaching zero and thus it is most probably not stationary.It can also be observed that the majority of the ACF values are not within the bounds of statistical significance.

Linear Trend

On the figure below, we observe that there is a slight upward trend.

Seosonality

Time series after removing linear trend and seasonality

Multivariate Data

For the multivariate analysis we’re choosing to study the total fossil and renewable energy consumption.For this purpose we create a new dataframe containing the variables of interest and then we reshape it in order to make our analysis easier.

We can see from the below graphic that there is a significant disparity between the usage of fossil fuels and renewable energy sources. We will examine each one in more detail in the sections that follow.

ACF and PACF